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[57] ABSTRACT 

In a partitioned database system of the Shared Nothing 
type, one or more secondary replicas of each partition 
are maintained by spooling (i.e., asynchronously send- 
ing) modified (usually called dirty) pages from the pri- 
mary replica to the secondary replica(s) rather than by 
using a synchronous page update or by sending log 
entries instead of entire pages. A Write-Ahead Log 
protocol is used so that a dirty page is not forced to 
non-volatile storage until a log record of the modifica- 
tion is created and written to non-volatile storage. Rep- 
lica updating does not delay the committing of transac- 
tions because replica updating is done asynchronously 
with respect to transaction processing. Since dirty 
pages are sent rather than only log entries, disk accesses 
and processing at the secondary replica(s) arising from 
the maintaining of the replicas are minimized as well. 
Only one centrally accessible log is maintained for all 
replicas of the same partition. 

21 Claims, 3 Drawing Sheets 
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Prior art generally relating to recovery from failures 

ASYNCHRONOUS REPLICA MANAGEMENT IN in single site database system is surveyed by T. Haerder 

SHARED NOTHING ARCHITECTURES and A. Reuter in "Principles of Transaction-Oriented 

Database Recovery", ACM Computing Surveys, Sep- 

This is a continuation of application Ser. No. 5 tember. 1983. 

07/809,354 filed Dec. 18, 1991 now abandoned. One type of recovery method, which relies upon the 

ptt:t r\x: nre Txrircvrrirw creation of a log of changes to database pages, is called 

FIELD OF THE INVENTION Write-Ahead Logging A Write-Ahead Logging 

This invention relates to multiple processor Shared method that has been named ARIES is described by C. 

Nothing transaction processing computer systems 10 Mohan, D. Haderle B. Lindsay, II. Pirahesh, and P. 

which use a Write-Ahead Log protocol and more par- Schwarz in "ARIES: A Transaction Recovery Method 

ticularly to such systems in which two or more copies Supporting Fine-Granularity Locking and Partial Roll- 

(usually called replicas) of the same information are backs Using Write-Ahead Logging", RJ 6649, IBM 

maintained by separate processors in order to provide Almaden Research Center, San Jose, Calif., January 

failure safety and high availability. 15 1989. ARIES is also described by C. Mohan, D. Ha- 

BACKGROUND OF THE INVENTION de ^ + B * * L P *^/^ * - Schwa« jn US. 

patent application Ser. No. 59,666 filed on Jun. 8, 1987 

FIG. 1 shows a typical Shared Nothing computer and entitled 4 'Method for Managing Sub-page Concur- 
system architecture in the form of a database. In a data- rency Control and Partial Transaction Rollback in a 
base of such an architectures, the database infonnation 20 Transaction-Oriented System ofThe Write- Ahead Log- 
is partitioned over loosely coupled multiple processors ging Type". Since the recovery method in the preferred 
20 typically connected by a local area network 10. Each embodiment of the present invention is based upon the 
of the multiple processors 20a-20n typically has its own ARIES method of recovery, these two ARIES refer- 
private non- volatile storage 30a-30« and its own private ences are hereby fully incorporated by reference, 
memory 40a-40/L One problem with a Shared Nothing Replica control protocols are described, for example, 
architecture in which information is distributed over in U.S. Pat. Nos. 4,007,450, 4,432,057, 4,714,992 and 
multiple nodes is that it typically cannot operate very 4,714,996. However, this invention is directed instead 
well if any of the nodes fails because then some of the towards the problem of keeping secondary replicas 
distributed information is not available anymore. Trans- 3Q up-to-date and not with replica control. The replica 
actions which need to access data at a failed node can- control method used in accordance with the preferred 
not proceed. If database relations are partitioned across embodiment of this invention is the primary copy access 
all nodes, almost no transaction can proceed when a scheme described, for example, by P. A. Bernstein, V. 
node has failed. Hadzilacos, and N. Goodman in "Concurrency Control 

The likelihood of a node failure increases with the 35 and Recovery in Database Systems," Addison Wesley, 

number of nodes. Furthermore, there are a number of 1987. 

different types of failures which can result in failure of The usual prior art approach for maintaining replicas 

a single node. For example: is to maintain the secondary replicas always up-to-date 

(a) A processor could fail at a node; with respect to the primary. Two distinct mechanisms 

(b) A non-volatile storage device or controller for ^ have been proposed for such "synchronous" updating 
such a device could fail at a node; of secondary replicas— at the physical level and at the 

(c) A software crash could occur at a node; or logical level. 

(d) A communication failure could occur resulting in In physical replication, the bits stored on disk of the 
all other nodes losing communication with a node. primary are copied onto a secondary disk, and there are 

In order to provide high availability (i.e., continued 45 no semantics associated with the bits. In contrast, logi- 
operation) even in the presence of a node failure, infor- cal replication attempts to take advantage of database 
mation is commonly replicated at more than one node, semantics by duplicating the database state at the see- 
so that in the event of a failure of a node, the infonna- ondary replica. 

tion stored at that failed node can be obtained instead at Physical level replica maintenance is typified by Tari- 

another node which has not failed. The multiple copies 50 dem's NonStop SQL as described, for example, by the 

of information are usually called replicas, one of which Tandem Database Group in "NonStop SQL, A Distrib- 

is usually considered the primary replica and the one or uted, High-Performance, High-Reliability Implementa- 

more other copies considered the secondary replica(s). tion of SQL," Workshop on High Performance Trans- 

The maintenance of replicas always involves an action Systems, Asilomar, Calif., September 1987, and 

added workload for the computer system. This inven- 55 by various extensions to RAID) (described by D. Pat- 

tion specifically relates to the problem of maintaining terson, G. Gibson and R. Katz in "A Case For Redun- 

replicas in a more efficient manner. dant Arrays of Inexpensive Disks (RAID)),*' Proceed- 

DFSPR TPTION OF THF PR TOR art mgs of ^ A CM-SIGMOD) International Conference 

DESCRIPTION OF THE PRIOR ART Qn Management of Data , Chicago, May 1988), for ex- 

A general review of storage media recovery tech- 60 ample RADD (described by M. Stonebraker and G. 

niques has been published by G. Copeland and T. Keller Schloss in "Distributed RAID)— A New Multiple 

in "A Comparison of High-Availability Media Recov- Copy Algorithm," Proceedings of the 6th International 

ery Techniques," Proceedings of the ACM-SIGMOD Conference on Data Engineering, Los Angeles, Febru- 

Intemational Conference on Management of Data, ary 1990). Tandem achieves this by using dual-ported 

Portland, Oreg., June 1989. 65 mirrored disks to achieve high availability, but requires 

The Shared Nothing architecture is described by M. special purpose hardware. RADD suffers from exces- 
Stonebraker in "The Case For Shared Nothing," Data- sive communication overhead, since every write De- 
base Engineering, Vol. 9, No. 1, 1986. comes a distributed write. 
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Synchronous updates based on logical replication are 
exemplified by the Teradata DBC/1012 ("DBC/1012 
Database Computer System Manual Release 2.0," Doc- 
ument No. C10-O0Q1-O2, Teradata Corp., November 
1985) and by the GAMMA database machine, which is 
described by D. DeWitt, S. Ghandeharizadeh, D. 
Schneider, A. Bricker, II. Ilsiao, and R. Rasmussen in 
"The Gamma Database Machine Project," IEEE 
Transactions on Knowledge and Data Engineering, 
Vol. 2, No. 1, pp 44-62, March 1990. In Teradata's 
database machine, a transaction is run concurrently on 
the two replicas, which incurs significant CPU and I/O 
overhead in comparison to running a transaction only 
one time. For the GAMMA machine it has been pro- 
posed to send dirty pages to the secondary replica at 
transaction commit time (See "Performance and Avail- 
ability In Database Machines With Replicated Data" by 
Hui-I Hsiao, Computer Sciences Technical Report 
#963, University of Wisconsin, August 1990), which 
incurs significant communication overhead at both pri- 
mary and secondary replicas and possibly large I/O 
overhead at the secondary replica for hot pages. In 
addition, it results in a worsened response time during 
normal operation. 

Another approach has been to update the secondary 
replicas independently (and generally later) than the 
primary replica. The advantage of such an "asynchro- 
nous" update is that it imposes lesser overhead during 
failure free operation. Proposals using this mechanism 
have all sent the log entries to the secondary replicas 
before committing a transaction. The log entries then 
are used at the secondary replicas to make the same 
modifications to the secondary replicas as were made to 
the primary replica when the log was produced. Such 
an asynchronous update is described by R. King, II. 
Garcia-Molina, N. Halim, and C Polyzois in "Manage- 
ment of A Remote Backup Copy for Disaster Recov- 
ery," University of Princeton CS-TR-198-88 and by C. 
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A still further object is to provide more efficient 
recovery of a partitioned database system of the Shared 
Nothing type which maintains replicas when a replica 
becomes unavailable. 

Another object is to minimize the impact of maintain- 
ing replicas in a Shared Nothing architecture though 
advantageous use of a high bandwidth interconnection 
system. 

These and further objects are achieved in accordance 
with this invention by maintaining one or more second- 
ary replicas by spooling (i.e., asynchronously sending) 
modified (usually called dirty) pages From the primary 
replica to the secondary replica(s) rather than by using 
a synchronous update or by sending log entries instead 
of entire pages. A Write-Ahead Log protocol is used, 
which means that a dirty page is not forced to non- 
volatile storage until a log record of the modification is 
created and written to non-volatile storage. Since dirty 
pages are sent from the primary replica to the secondary 
20 replica(s), disk accesses and processing at the secondary 
replica(s) arising from the maintaining of the replicas 
are minimized. 

Furthermore, since log entries are not needed to 
maintain the replicas during failure free operation, in 
the preferred embodiment only one centrally accessible 
log is maintained for all replicas of the same partition. 
The central log may be hardened against failure in any 
way, including the use of a backup. 
The primary replica maintains a dirty page table 
30 which identifies not only the dirty pages which have 
not been stored yet in its own non-volatile storage but 
also for each of the secondary replicas (if there is more 
than one) identifies the dirty pages which have not yet 
been sent to the secondary replica and acknowledged 
by the secondary replica as stored in its non- volatile 
storage. 

If a Failure occurs with respect to the primary rep- 
lica, the secondary replica (or one of them if more than 
one is being maintained) is brought up to date using the 



25 



35 



Mohan K. Treiber, and R. Obermarck in "Algorithms 40 log ^ it then ^ over the role of ^ primary rep . 



For the Management of Remote Backup Data Bases for 
Disaster Recovery," IBM Research Report, July 1990. 

SUMMARY OF THE INVENTION 

It is an object of this invention to more efficiently 45 
provide failure safety and high availability in Shared 
Nothing architectures. 

It is another object of this invention to more effi- 
ciently maintain replicas in a transaction-oriented parti- 
tioned database system. 

A further object is to more efficiently maintain repli- 
cas on separate processing systems using a Write- Ahead 
Log based protocol. 

It is also an object to maintain replicas without re- 
quiring special purpose hardware. 

Another object is to maintain replicas on separate 
processing systems while minimizing the workload 
thereby added to both system. 

Still another object is to maintain replicas on separate 
processing systems without thereby causing any delays 60 
in either such system. 

A further object is to update replica on separate pro- 
cessing systems without restricting the updates to take 
place before or at the transaction commit time. 

It is also an object to maintain a secondary replica on 
a separate system than the corresponding primary rep- 
lica while minimizing required disk accesses and re- 
quired processing at the secondary replica. 



lica. 

When a failed system recovers from the failure, the 
replica(s) it maintains must be brought up to date. This 
may be done from the corresponding primary replica by 
sending dirty pages to the recovering system (assuming 
the corresponding primary replica is available and has 
maintained the needed dirty page table entries). Recov- 
ery may be done alternatively from the log in the event 
the corresponding primary replica is not available or 
50 has not maintained the needed dirty page table entries 
or because recovery from the log is preferred. 

In order to keep track of the various replicas, a node 
manager service is also provided which co-ordinates the 
recovery and ensures that transactions are correctly 
routed to the primary replica. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows the organization of a Shared Nothing 
architecture. 

FIG. 2 shows the information flow between a pri- 
mary and a secondary replica. 

FIG. 3 shows the Shared Nothing architecture using 
a hardened log server. 
FIG. 4 shows the dirty page table structure for AR- 
65 IES. 

FIG. 5 shows the transaction table used in ARIES. 
FIG. 6 shows the dirty page table structures used in 
this invention. 
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FIG. 7 shows the tabie maintained by the node man- Log data is stored on stable storage such that it re- 
ager. mains accessible to any replica after any single failure. 

DESCRIPTION OF THE PREFERRED For ^ simptet arrangement is a hardened log 

-c\jror\T\Txjrc^v^ server, as shown in FIG. 3. In such a scheme, a typical 

5 node 310 sends its log stream 330 over the local area 
The invention is preferably used in connection with a network 340 to a hardened log server 320 that can sur- 
partitioned database system and will be described now v ive single failures. Furthermore, a log record is written 
in further detail in that environment. t o the server before its corresponding page is sent to the 

Maintaining replicas always involves additional disk 315 in order to satisfy the write-ahead log protocol, 
workload for a computer system. For each data parti- 10 if a failure occurs on the primary replica, the second- 
tion in a partitioned database system which maintains ary replica is brought up-to-date with the primary rep- 
replicas, there is one primary replica against which Hca using the log obtained from the server. It then takes 
update transactions run. There is at least one and may over the role of the primary an d transactions are now 
be a number of secondary replicas of each partition, mn against it Since the secondary has to recover from 
which are updated m accordance with this invention by 15 ^ to ^ ]og reflects ^ storage state of ^ 
sending dirty pages over to the one or more secondary prim ary as well as of all the secondaries. This makes the 
"... „ . . « , log symmetrical with respect to the primary and all the 

illustrates a portion of a partitioned database secondaries so that at any time any of the replicas can 
in which there are two replicas, Rl and R2, for one of rec0V er from the log. 

the partitions. Assume that transactions are directed to 20 Tf a faflure oea J ia a secon ^ replica, operations 
, in ? 6 . Pnmary rephca ^of the illus- ^ d ^ ^ ^ se P condar ^ 

trated partitton) permanently stored on disk 230. Trans- , , . A *\. „ A . i j 

r . . . , comes back up, it partially recovers using the log, and 

actions running m processor 220 use pages 260 in vola- zr , _T p ' " y ~ y r I A /er^Tr 

«™ — -Tn, M 11C - - n v then starts accepting a stream of dirty pages (SDP) from 

tile memory 225. These pages 225 were originally ob- , r & j v & \ / 

tained from the permanent copy of replica Rl on disk 25 Vx™?*™** - A + -, r , , • , + - 

The following details of a preferred implementation 

Apagebrought(i.e.,co P ied)fromdi S kRltomemory of /¥ s ^ vention now ™» bedfscribed: 
225 is considered "dirty»0 when it gets modified by < a > ^ needed to the wnte-ahead logging 
processor220becauseitisthennotthesameanymoreas method ^own as ARIES (which is our preferred 
the copy of that page stored on disk 230. When a page 30 method) in order to keep track of two 
260 gets modified, a log record of the modification is rephcas rather than a single copy; 
made and sometime after the log record for an updated Ae actlons b ? ^ P 1 ™^ during failure- 
page 260 has been written, the modified page can be free °P eration t0 kee P tne secondary replica up-to- 
written by processor 220 to replica Rl's disk 230. Many date J 

modifications can be made to the page before it is 35 ( c ) s y stem and node management actions, performed 

pushed out (Le-, copied) to disk 230. h ? a node manager, which include detection of 

In accordance with this invention, at some time after failure, assignment of primary and secondary roles 

the log record is written, the dirty page (Identified by 311(1 switching of these roles (The node manager 

reference number 26Qd in FIG. 2) is spooled over to must take consistent actions in the face of lost mes- 

processor 240, which is handling replica R2. Dirty 40 ^S^- network partitions and other errors); and 

pages can be sent one at a time or as a stream of dirty ( d ) now recovery is achieved in a number of failure 

pages (SDP) that flow over the communication net- scenarios. 

work 210 from processor 220 to processor 240. It would Before describing these functions in detail, the salient 

be preferable for the sake of performance to spool a relevant features of ARIES will be described. ARIES is 

dirty page to the secondary replica before it is thrown 45 described in more detail in the two references cited 

out of the primary's volatile memory. However, this is earlier. 

not strictly required. In order to minimize the spooling ARIES Write-Ahead Logging Method 

overhead, we prefer to send the dirty page to the sec- & 

ondary when it is pushed out of volatile memory onto ARIES maintains a dirty page table (DPT) (FIG. 4) 

the non-volatile permanent storage. 50 f° r all pages that have not yet been pushed to the disk. 

When replica R2*s processor receives the copy 270W This DPT contains two fields for each page that is dirty 

of dirty page 260rf, it will force the dirty page to its local and in-memory: the page-id field 420, and the RecLSN 

disk 250 either immediately or at some later time, and field 430 (for Recovery Log Sequence Number, which 

then send back an acknowledgment to processor 220. is the address of the next log record to be written when 

The broken lines in FIG. 2 illustrate the flows of a page 55 the page was first modified). Whenever a non-dirty 

which becomes dirty in memory 225. page is updated in memory, a row 440 is added to the 

Writes to disk at any replica must follow the Write- DPT with the corresponding Page-Id and the next LSN 

Ahead Log protocol, i.e. a dirty page must not be writ- to be assigned (Le., RecLSN). The RecLSN ind posi- 

ten to disk until the log record describing that dirty tion in the log from which an examination must begin to 

page is written out to stable storage. Other than this 60 discover the log record that describes the changes made 

condition, the invention provides for complete asyn- to this page. Whenever a page is written to the disk, the 

chrony between the writes at the various replicas. corresponding entry in the DPT is deleted. Further- 

Because transactions are running against the data more, at a checkpoint (CP), the current DPT is written 

stored at the primary replica only, there is very little to the CP log record. 

runtime overhead at the other replicas. And because 65 When the database system is recovering from a crash, 

transactions can commit without waiting for corre- it goes through the log from the last checkpoint record 

sponding dirty pages to be written at secondary repli- (which contains the DPT at the time of the checkpoint) 

cas, there is little or no response time delay. to the end, and reconstructs the DPT. In addition, it 
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determines winner and in-flight transactions. This scan 
is called analysis phase. 

In accordance with the ARIES procedure, the log is 
then examined in the forward direction (starting at a 
position which is called the Minimum RecLSN, which 5 
is the minimum RecLSN of the entries in the recon- 
structed DPT, and indicates the earliest change in the 
database that was potentially not written to the disk), to 
uncover all redoable actions. For each redoable action 
for a page that was potentially dirty at the time of the 10 
crash and which passes certain criteria, ARIES redoes 
the action. 

Finally, in the third pass, ARIES proceeds back- 
wards from the end of the log, undoing all in-flight 
transactions, i.e. transactions that were in progress at 15 
the time of the crash. FIG. 5 shows the transaction table 
used by ARIES. In this table, there is a row 560 for each 
transaction, containing its transaction identifier 520, 
state (committed, aborted or in-flight) 530, last log re- 
cord written 540, and the last log record that has not yet 20 
been undone 550. 

The four aspects will now be described that were 
identified above (namely, changes to the ARIES log, 
norma] operations, the node manager, and recovery). 
For the purpose of this description, we will focus on a 25 
particular database partition that has two replicas: Rl 
and R2. Anyone of ordinary skill would be able to ex- 
tend this easily to a system having more than two repli- 
cas for each partition. One of the two replicas plays the 
role of the primary (P) and the other the role of the 30 
secondary (S) for this partition. 

Design of the Unified Log 

ARIES records information about which pages are in 
volatile storage (main memory) only in its checkpoint 35 
log records. All other log records are independent of 
the main memory state (or disk state). Since the disk 
states of the two replicas Rl and R2 will in general be 
different, the ARIES checkpoint must be modified to 
reflect this. All other log records will apply equally 40 
well to both Rl and R2. 

The primary P maintains more information in its 
DPT than when only one replica is present in a system. 
Effectively two DPTs are maintained — a DPTR1 and a 
DPTR2 (FIG. 6). It should be apparent that the two 45 
DPTs maintained by P can be unified in various ways 
to more compactly store the same information. For Ri 
(i= 1,2), the DPTs maintained by P contain a row 640, 
680 for each page that is potentially dirty and not yet on 
Ri's disk. For pages that have just been updated on P, an 50 
entry will exist in both DPTs. However, for pages that 
have been forced to P's disk and not yet to S's disk, 
there will be an entry only in the DPT of Rk where Rk 
is playing the role of S. 

Thus, whenever a clean page is brought into the 55 
memory of P for updating, an entry is added to both of 
the DPTs with the current RecLSN value. When a 
page is to be forced to P's disk, it is simultaneously 
spooled to S's disk if it has not yet been done. Whenever 
the disk write completes at Ri (i=l,2) and P receives 60 
the acknowledgment of the same, the corresponding 
entry in DPTRi can be deleted. 

After a disk write for a page is started at P and the 
page is shipped to S, the page might need to be updated 
again at P. This results in two entries for the same page 65 
in the DPT of each replica for which the disk write is 
not yet complete. This is more likely for S since there is 
a round-trip message delay in addition to the disk write 
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latency. As a consequence, the acknowledgments from 
S (and local disk writes) should not only reflect the 
Pageld, but also the RecLSN number of the received 
page, so that P can delete the appropriate entry in 
DPTRi. Finally, the recovery procedure must use the 
entry with the smallest RecLSN for recovery purposes. 

If a current S is down, in effect, acknowledgements 
from S for disk writes are never received at P. Thus, if 
•one were prepared to write increasingly larger DPTRk 
(if Rk is playing the role of S), then no changes need to 
be made to the procedure. Let us define SDownCP to 
be the checkpoint immediately preceding the time when 
S went down. Then, an effective way to bound the size 
of the DPT for the secondary is for P to write a pointer 
to SDownCP in place of the DPT For the secondary in 
all subsequent CP's. In effect, P behaves as if it is the 
only replica, and hence writes only its own DPT. When 
S recovers after being down, it starts the analysis phase 
of ARIES from SDownCP. If desired, a CP can be 
taken immediately when S goes"a*bwn, and this CP will 
become SDownCP. 

Failure-Free Operation 

Database operations are executed only on the pri- 
mary replica. In order to keep replicas P and S reason- 
ably synchronized with respect to the database state, 
updated pages are sent to S at some time before they are 
discarded from P's buffer. There are a number of poli- 
cies possible for doing this. They involve different 
trade-offs between recovery time and CPU, disk and 
network overheads during failure-free processing de- 
pending on how soon after update (i.e., dirty) pages are 
sent to S. The only criterion needed for correctness is 
that the Write-Ahead log protocol be used, i.e. an up- 
dated page be sent to S only after the log record de- 
scribing the update is written to the log. The set of 
updated pages is termed SDP, for Stream of Dirty 
Pages. Furthermore, S acknowledges to P when it 
writes dirty pages to its disk (it may buffer pages in its 
own memory for faster recovery). 

Meanwhile, all log records are written to a log server. 
The log represents the disk state of both the primary 
and the secondary in a unified fashion, enabling either 
replica to recover to the latest transaction consistent 
state when required. 

The primary carries out the following procedure 
(described in Pseudo-Code) to update the secondary 
replica: 



For each dirty page in buffer 

if (S is up) 
{ 

Send page to S after latest log record modifying 
the page is written out and before the page is 
expelled from the buffer. * 

After receiving acknowledgement from S that page is on 
disk delete entry for page from DPT for S. 

} 

else 

/* Do nothing, i.e. P behaves as if it is the only replica V 

} 

At checkpoint time do: 
■ if (Sis up) 
{ 

write DPT of both P and S in checkpoint, 
inform S that a checkpoint has taken place. 
/* For S to reset the "Received List" described later 

} 

else 
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write DPT of P and a pointer to the latest SDownCP. 



The Node Manager 

A mechanism is required in accordance with this 
invention to ensure that inconsistent actions do not take 
place as a consequence of network* partitions, lost mes- 
sages or other errors. For example, two nodes which 10 
both have copies of a given partition should not both 
decide to take over the primary role. We call this the 
node manager function. 

This mechanism is best implemented at the log server 
node since it can enforce its view of the system state by 15 
not allowing the "wrong" node to write log records. A 
node which cannot write log records cannot commit 
any transaction and hence can do no harm. In our pre* 
ferred scheme, the node manager keeps track of the 
state of each partition. If a primary fails, it asks the 2 <> 
secondary to take over the role of primary after recov- 
ering its database state. If a secondary fails, it asks the 
primary to record SDownCP. 

The node manager is informed by every node when it 
recovers after a failure. Based on a state table that the 25 
node manager keeps for every partition, it sends this 
node a message asking it to recover. The state table 
records the information shown in FIG, 7 for each parti- 
tion, which includes the following fields: 

Rl 720, which identifies the location of the first rep- 30 
lica; 

R2 730, which identifies the location of the second 
replica; 

P760, which identifies the current primary node (one 
of Rl or R2) and therefore also identifies the current 35 
secondary node S as the node different from P; 

StateRl 740, which records whether the replica Rl is 
up or down; 

StateR2 750, which records whether the replica R2 is 
up or down; 

Recovery 

A node (say Rl) is asked to recover in one of the 
following two scenarios: 

(1) It comes back up after a failure: 45 
It first informs the node manager, which then decides 

the role assignment for Rl. If it happens to be the 
secondary, then the node manager asks the corre- 
sponding primary to log an STJp record. The node 
manager then asks Rl to recover, either till SUp, or 
till the end of the log. 

(2) It is asked to take over the role of the primary: 
In this case, the node manager asks it to catch up and 

recover to the end of the log. 
The actions taken by Rl during recovery are de- 55 
scribed in Pseudo-Code as follows, which should be 
self-explanatory: 



40 



50 



Request last checkpoint frotn log server, 
if (checkpoint has DPT for Rl) 
{ 

Start ARIES analysis pass from this DPT. 
Delete pages from this DPT which 
occur in "Received Last". 



} 

else 
{ 
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Follow pointer to SDownCP. 

Start ARIES analysis pass from this point and 



reconstruct (conservative) DPT before failure. 

} 

Perform REDO and UNDO ARIES passes and complete 
recovery. 



The "Received List" is a list of pages S received from 
the primary after the last checkpoint. This list is reset to 
zero after S receives the information that P has taken a 
checkpoint. If S is taking over, then after finishing the 
analysis pass, pages in this list should be deleted from 
the reconstructed DPT. This helps reduce the size of 
the DPT by eliminating those pages that occur in the 
DPT, but have been received since the last CP. 

When a primary fails, and a secondary is recovering, 
it must read pages in the DPT from disk to apply log 
records to it during the redo and undo phases of AR- 
IES. The disk I/O would not have to be performed if 
the secondary buffered hot ancxoften updated pages in 
main memory. The secondary can get hints from the 
primary about which pages are good candidates for 
such buffering. The primary could keep information 
about how long each page has been in its buffer and thus 
provide such hints. 

While the invention has been particularly shown and 
described with reference to a preferred embodiment 
thereof; it will be understood by those skilled in the art 
that various changes in form and detail may be made 
without departing from the spirit and scope of the in- 
vention. 

Having thus described our invention what we claim 
as new and desire to secure as Letters Patent, is: 

1. An improved transaction processing system of the 
type which includes: 

a plurality of transaction processors connected by a 
communication network, each of said processors 
having its own private storage; 

said transaction processing system having at least one 
partition of data which is subpartitioned into pages, 
said partition of data including a primary replica of 
said partition and a secondary replica of said parti- 
tion, said primary replica being stored in said pri- 
vate storage of one of said processors and said 
secondary replica being stored in said private stor- 
age of another one of said processors; and 

a non-volatile log, said transaction processing system 
using a Write-Ahead Log protocol in which a mod- 
ification to a page in said partition is not considered 
made until said modification is stored in said log; 

wherein the improvement comprises: 

means for generating a response indicating to a trans- 
action requestor that a requested transaction on the 
primary replica has completed; 

means for updating said secondary replica from said 
primary replica by asynchronously sending modi- 
fied pages from said primary replica to said second- 
ary replica independently of the generation of the 
response by the means for generating and without 
imposing any page sending order or timing con- 
straint; 

means for keeping track of the pages of said partition 
which have been modified in said primary replica 
and not yet modified in said secondary replica; and 

means for making said log accessible to both said 
processors which are storing a replica of said parti- 
tion, 
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whereby said secondary replica is maintained without 
imposing delay upon the processing of transactions 
on said primary replica by said one processor. 

Z A transaction processing system as defined in claim 

1 wherein said transaction processing system has a plu- 5 
rality of partitions of "data, each of said plurality of 
partitions of data including a primary replica and at 
least one secondary replica, each of said replicas of a 
same partition being stored in different ones of said 
private storages. 10 

3. A transaction processing system as defined in claim 

2 wherein said partitions of data form a distributed 
database of the Shared Nothing type. 

4. A transaction processing system as defined in claim 

2 and further comprising a node manager for identify- 15 
ing the particular processor which is deemed to have 
said primary replica of any particular partition in its 
private storage and for identifying each particular pro- 
cessor which has one of said at least one secondary 
replica of said particular partition in its private storage. 20 

5. A transaction processing system as defined in claim 
4 wherein said node manager tracks the operational 
status of each of said replicas and in the event of failure 
of the primary replica of any partition identifies a non- 
failed secondary replica of said any partition as the 25 
primary replica. 

6. A transaction processing system as defined in claim 
2 wherein transactions are completed only against the 
particular replica of any partition which is identified by 
said node manager as the primary replica. 30 

7. A transaction processing system as defined in claim . 
2 wherein any one replica of any one or more partitions 
can fail without thereby causing said transaction pro- 
cessing system to fail. 

8. A transaction processing system as defined in claim 35 
7 wherein a failed replica can be restored from said log. 

9. A transaction processing system as defined in claim 
7 wherein a failed replica of a partition can be restored 
from another replica of said partition. 

10. A transaction processing system as defined in 40 
claim 1 wherein each of said private storages includes a 
non-volatile portion and a volatile portion, said replica 
of a partition stored in each of said private storages 
being permanently stored in said non-volatile portion 
thereof, individual pages of said primary replica being 45 
fetched to said volatile portion of said private storage of 
said one processor for use, at least some of said fetched 
pages becoming modified and being forced back to said 
non-volatile portion of said private storage of said one 
processor. 50 

11. A transaction processing system as defined in 
claim 10 wherein said means for updating includes 
means for sending a modified page of said primary rep- 
lica from said volatile private storage portion of said 
one processor to said volatile private storage portion of 55 
said another processor for transfer into said non-volatile 
storage portion of said another processor. 
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12. A transaction processing system as defined in 
claim 11 wherein said means for updating further com- 
prises means for acknowledging to said, one processor 
that a modified page of said primary replica sent from 
said volatile private storage portion of said one proces- 
sor has been stored in said non-volatile storage portion 
of said another processor. 

13. A transaction processing system as defined in 
claim 11 wherein said modified page is transferred into 
said non-volatile storage portion of said another proces- 
sor before said modified page is forced back to said 
non-volatile private storage portion of said one proces- 
sor. 

14. A transaction processing system as defined in 
claim 11 wherein said modified page is sent to said pri- 
vate storage of said another processor when said modi- 
fied page is forced back to said non-volatile private 
storage portion of said one processor. 

15. A transaction processing system as defined in 
claim 11 wherein said means fot-keeping track includes 
table means for identifying pages of said partition that 
have been modified and not yet stored in said non- 
volatile private storage portion of said another proces- 
sor. 

16. A transaction processing system as defined in 
claim 15 wherein said table means includes a first dirty 
page table for identifying modified pages of said parti- 
tion that have not yet been stored in said non-volatile 
private storage portion of said one processor and a 
second dirty page table for identifying modified pages 
of said partition that have not yet been stored in said 
non-volatile private storage portion of said another 
processor. 

17. A transaction processing system as defined in 
claim 1 wherein said log is stored in a log server. 

18. A transaction processing system as defined in 
claim 17 wherein said means for making said log acces- 
sible includes means connecting said log server to said 
transaction processors. 

19. A transaction processing system as defined in 
claim 18 wherein said means for making said log acces- 
sible includes means connecting said log server to said 
communication system. 

20. A transaction processing system as defined in 
claim 1 wherein said primary replica can be updated by 
multiple transactions asynchronously with the updating 
of the secondary replica. 

21. A transaction processing system as defined in 
claim 1 wherein said private storage of said one proces- 
sor includes volatile private storage and non-volatile 
private storage and said means for updating includes 
means for sending a modified page to said private stor- 
age of said another processor in response to said modi- 
fied page being sent from said volatile private storage of 
said one processor to said non-volatile private storage 
of said one processor. 

«.**** 
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